Community structure in social and biological networks 
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A number of recent studies have focused on the statistical properties of networked systems such 
as social networks and the World-Wide Web. Researchers have concentrated particularly on a 
few properties which seem to be common to many networks: the small-world property, power-law 
degree distributions, and network transitivity. In this paper, we highlight another property which is 
found in many networks, the property of community structure, in which network nodes are joined 
together in tightly-knit groups between which there are only looser connections. We propose a new 
method for detecting such communities, built around the idea of using centrality indices to find 
community boundaries. We test our method on computer generated and real-world graphs whose 
community structure is already known, and find that it detects this known structure with high 
sensitivity and reliability. We also apply the method to two networks whose community structure is 
not well-known — a collaboration network and a food web — and find that it detects significant and 
informative community divisions in both cases. 



I. INTRODUCTION 

Many systems take the form of networks, sets of 
nodes or vertices joined together in pairs by links or 
edges 0. Examples include social networks ^ ||, ^ 
such as acquaintance networks |^] and collaboration net- 
works H , technological networks such as the Internet , 
the World-Wide Web ||, ||, and power grids and 
biological networks such as neural networks |4|7 food 
webs [jll^, and metabolic networks |0], [l^]. Recent re- 
search on networks among mathematicians and physi- 
cists has focused on a number of distinctive statistical 
properties that most networks seem to share. One such 
property is the "small world effect," which is the name 
given to the finding that the average distance between 
vertices in a network is short |b|, |l4| , usually scaling log- 
arithmically with the total number n of vertices. Another 
is the right-skewed degree distributions that many net- 
works possess ||, ||, |ll| [l6[ 17 . The degree of a vertex 
in a network is the number of other vertices to which it 
is connected, and one finds that there are typically many 
vertices in a network with low degree and a small number 
with high degree, the precise distribution often following 
a power-law or exponential form |l], |5[ [l5| . 

A third property that many networks have in common 
is clustering, or network transitivity, which is the prop- 
erty that two vertices that are both neighbors of the same 
third vertex have a heightened probability of also being 
neighbors of one another. In the language of social net- 
works, two of your friends will have a greater probability 
of knowing one another than will two people chosen at 
random from the population, on account of their com- 
mon acquaintance with you. This effect is quantified by 
the clustering coefficient C 0, [lS] , defined by 
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graph (everyone knows everyone else) and has typical val- 
ues in the range 0.1 to 0.5 in many real- world networks. 

In this paper, we consider another property which, as 
we will show, appears to be common to many networks, 
the property of community structure. (This property is 
also sometimes called clustering, but we refrain from this 
usage to avoid confusion with the other meaning of the 
word clustering introduced in the preceding paragraph.) 
Consider for a moment the case of social networks — 
networks of friendships or other acquaintances between 
individuals. It is matter of common experience that such 
networks seem to have communities in them: subsets 
of vertices within which vertex-vertex connections are 
dense, but between which connections are less dense. A 
figurative sketch of a network with such a community 
structure is shown in Fig. |l|. (Certainly it is possible that 
the communities themselves also join together to form 
meta-communities, and that those meta-communities are 
themselves joined together, and so on in a hierarchical 




(number of connected triples of vertices) 

This number is precisely the probability that two of one's 
friends are friends themselves. It is 1 on a fully connected 



FIG. 1: A schematic representation of a network with commu- 
nity structure. In this network there are three communities 
of densely connected vertices (circles with solid lines) , with a 
much lower density of connections (gray lines) between them. 
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fashion. This idea is discussed further in Section 
The ability to detect community structure in a network 
could clearly have practical applications. Communities 
in a social network might represent real social groupings, 
perhaps by interest or background; communities in a ci- 
tation network fl9|| might represent related papers on a 
single topic; communities in a metabolic network might 
represent cycles and other functional groupings; commu- 
nities in the Web might represent pages on related topics. 
Being able to identify these communities could help us to 
understand and exploit these networks more effectively. 

In this paper we propose a new method for detecting 
community structure and apply it to the study of a num- 
ber of different social and biological networks. As we 
will show, when applied to networks for which the com- 
munity structure is already known from other studies, 
our method appears to give excellent agreement with the 
expected results. When applied to networks for which 
we do not have other information about communities, 
it gives promising results which may help us understand 
better the interplay between network structure and func- 
tion. 



II. DETECTING COMMUNITY STRUCTURE 

In this section we review existing methods for detecting 
community structure and discuss the ways in which these 
approaches may fail, before describing our own method, 
which avoids some of the shortcomings of the traditional 
techniques. 



A. Traditional methods 

The traditional method for detecting community struc- 
ture in networks such as that depicted in Fig. |l| is hier- 
archical clustering. One first calculates a weight Wij for 
every pair i,j of vertices in the network, which repre- 
sents in some sense how closely connected the vertices 
are. (We give some examples of possible such weights 
below.) Then one takes the n vertices in the network, 
with no edges between them, and adds edges between 
pairs one by one in order of their weights, starting with 
the pair with the strongest weight and progressing to the 
weakest. As edges are added, the resulting graph shows 
a nested set of increasingly large components (connected 
subsets of vertices), which are taken to be the commu- 
nities. Since the components are properly nested, they 
can all be represented using a tree of the type shown 
in Fig. ||, in which the lowest level at which two ver- 
tices are connected represents the strength of the edge 
which resulted in their first becoming members of the 
same community. A "slice" through this tree at any level 
gives the communities which existed just before an edge 
of the corresponding weight was added. Trees of this type 
are sometimes called "dendrograms" in the sociological 
literature. 
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FIG. 2: An example of a small hierarchical clustering tree. 
The circles at the bottom of the figure represent the vertices 
in the network and the tree shows the order in which they 
join together to form communities for a given definition of 
the weight Wij of connections between vertex pairs. 



Many different weights have been proposed for use with 
hierarchical clustering algorithms. One possible defini- 
tion of the weight is the number of node-independent 
paths between vertices. Two paths which connect the 
same pair of vertices are said to be node-independent if 
they share none of the same vertices other than their ini- 
tial and final vertices. It is known [ ^p| that the number 
of node-independent paths between vertices i and j in a 
graph is equal to the minimum number of vertices that 
need be removed from the graph in order to disconnect i 
and j from one another. Thus this number is in a sense 
a measure of the robustness of the network to deletion of 
nodes 

Another possible way to define weights between ver- 
tices is to count the total number of paths that run be- 
tween them (all paths, not just node- independent ones). 
However, since the number of paths between any two ver- 
tices is infinite (unless it is zero), one typically weights 
paths of length t by a factor a 1 with a small, so that the 
weighted count of the number of paths converges [53. 
Thus long paths contribute exponentially less weight 
than short ones. If A is the adjacency matrix of the 
network, such that Aij is 1 if there is an edge between 
vertices i and j and otherwise, then the weights in this 
definition are given by the elements of the matrix 
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In order for the sum to converge, we must choose a 
smaller than the reciprocal of the largest eigenvalue of A. 

Both of these definitions of the weights give reasonable 
results for community structure in some cases. In other 
cases they are less successful. In particular, both have a 
tendency to separate single peripheral vertices from the 
communities to which they should rightly belong. If a 
vertex is, for example, connected to the rest of a network 
by only a single edge then, to the extent that it belongs to 
any community, it should clearly be considered to belong 
to the community at the other end of that edge. Unfortu- 
nately, both the numbers of node-independent paths and 
the weighted path counts for such vertices are small and 
hence single nodes often remain isolated from the network 
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when the communities are constructed. This and other 
pathologies, along with poor results from these methods 
in some networks where the community structure is well 
known from other studies, make the hierarchical cluster- 
ing method, although useful, far from perfect. 

B. Edge betweenness and community structure 

To sidestep the shortcomings of the hierarchical clus- 
tering method, we here propose a new approach to the 
detection of communities. Instead of trying to construct 
a measure which tells us which edges are most central 
to communities, we focus instead on those edges which 
are least central, the edges which are most "between" 
communities. Rather than constructing communities by 
adding the strongest edges to an initially empty vertex 
set, we construct them by progressively removing edges 
from the original graph. 

Vertex "betweenness" has been studied in the past as 
a measure of the centrality and influence of nodes in net- 
works. First proposed by Freeman (2[ |23|, the between- 
ness centrality of a vertex i is defined as the number of 
shortest paths between pairs of other vertices which run 
through i. It is a measure of the influence of a node over 
the flow of information between other nodes, especially 
in cases where information flow over a network primarily 
follows the shortest available path. 

In order to find which edges in a network are most "be- 
tween" other pairs of vertices, we generalize Freeman's 
betweenness centrality to edges and define the edge be- 
tweenness of an edge as the number of shortest paths 
between pairs of vertices that run along it. If there is 
more than one shortest path between a pair of vertices, 
each path is given equal weight such that the total weight 
of all the paths is unity. If a network contains commu- 
nities or groups that are only loosely connected by a few 
inter-group edges, then all shortest paths between differ- 
ent communities must go along one of these few edges. 
Thus, the edges connecting communities will have high 
edge betweenness. By removing these edges, we separate 
groups from one another and so reveal the underlying 
community structure of the graph. 

The algorithm we propose for identifying communities 
is simply stated as follows: 

1. Calculate the betweenness for all edges in the net- 
work. 

2. Remove the edge with the highest betweenness. 

3. Recalculate betweennesses for all edges affected by 
the removal. 

4. Repeat from step 2 until no edges remain. 

As a practical matter, we calculate the betweennesses 
using the fast algorithm of Newman [Q, which calcu- 
lates betweenness for all m edges in a graph of n vertices 
in time O(mn). Since this calculation has to be repeated 
once for the removal of each edge, the entire algorithm 



runs in worst-case time 0(m 2 n). However, following the 
removal of each edge, we only have to recalculate the 
betweennesses of those edges that were affected by the 
removal, which is at most only those in the same compo- 
nent as the removed edge. This means that running time 
may be better than worst-case for networks with strong 
community structure (ones which rapidly break up into 
separate components after the first few iterations of the 
algorithm) . 

To try to reduce the running time of the algorithm 
further, one might be tempted to calculate the between- 
nesses of all edges only once and then remove them in 
order of decreasing betweenness. We find however that 
this strategy does not work well, because if two commu- 
nities are connected by more than one edge, then there 
is no guarantee that all of those edges will have high 
betweenness — we only know that at least one of them 
will. By recalculating betweennesses after the removal of 
each edge we ensure that at least one of the remaining 
edges between two communities will always have a high 
value. 



III. TESTS OF THE METHOD 

In this section we present a number of tests of our algo- 
rithm on computer-generated graphs and on real-world 
networks for which the community structure is already 
known. In each case we find that our algorithm reliably 
detects the known structure. 



A. Computer-generated graphs 

To test the performance of our algorithm on networks 
with varying degrees of community structure, we have 
applied it to a large set of artificial, computer-generated 
graphs similar to those depicted in Fig. [I]. Each graph 
was constructed with 128 vertices, each of which was 
connected to exactly z = 16 others. The vertices were 
divided into four separate communities with some num- 
ber Zi n of each vertex's 16 connections made to randomly 
chosen members of its own community and the remaining 
^out = z — z m made to random members of other com- 
munities. This produces graphs which have known com- 
munity structure, but which are essentially random in 
other respects. Using these graphs, we tested the perfor- 
mance of our algorithm as the ratio of intra-community 
to inter-community connections was varied. The results 
are shown in Fig. y. As we can see, the algorithm per- 
forms near perfectly when z out < 6, classifying virtually 
100% of vertices into their correct communities. Only 
for z out > 6 does the fraction correctly classified start 
to fall off. In other words the algorithm performs per- 
fectly almost to the point at which each vertex has as 
many inter-community connections as intra-community 
ones. This is an encouraging first result for the method. 
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number of inter-community edges z, 



FIG. 3: The fraction of vertices correctly classified by our 
method as the number z out of inter-community edges per ver- 
tex is varied, for computer generated graphs of the type de- 
scribed in the text. The measurements with half- integer val- 
ues Zout = k + i are for graphs in which half the vertices 
had k inter-community connections and half had k + 1. Each 
point is an average over 100 realization of the graphs. Lines 
between points are included solely as a guide to the eye. 



B. Zachary's karate club study 

While computer-generated networks provide a repro- 
ducible and well-controlled test-bed for our community- 
structure algorithm, it is clearly desirable to test the al- 
gorithm on data from real-world networks as well. To 
this end, we have selected two datasets representing real- 
world networks for which the community structure is 
already known from other sources. The first of these 
is drawn from the well-known "karate club" study of 
Zachary [^5|. In this study, Zachary observed 34 mem- 
bers of a karate club over a period of two years. Dur- 
ing the course of the study, a disagreement developed 
between the administrator of the club and the club's 
instructor, which ultimately resulted in the instructor's 
leaving and starting a new club, taking about a half of 
the original club's members with him. 

Zachary constructed a network of friendships between 
members of the club, using a variety of measures to es- 
timate the strength of ties between individuals. Here 
we use a simple unweighted version of his network and 
apply our algorithm to it in an attempt to identify the 
factions involved in the split of club. Figure [|a shows 
the network, with the instructor and the administrator 
represented by nodes 1 and 34, respectively. Figure ^|b 
shows the hierarchical tree of communities produced by 
our method. The most fundamental split in the network 
is the first one at the top of the tree, which divides the 
network into two groups of roughly equal size. This split 
corresponds almost perfectly with the actual division of 
the club members following the break-up, as revealed by 
which club they attended afterwards. Only one node, 
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FIG. 4: fa) The friendship network from Zachary's karate club 
study |2JJ, as described in the text. Nodes associated with the 
club administrator's faction are drawn as circles, while those 
associated with the instructor's faction are drawn as squares, 
(b) The hierarchical tree showing the complete community 
structure for the network. The initial split of the network into 
two groups is in agreement with the actual factions observed 
by Zachary, with the exception that node 3 is misclassified. 



node 3, is classified incorrectly. In other words, the ap- 
plication of our algorithm to the empirically observed 
network of friendships is a good predictor of the subse- 
quent social evolution of the group. 



C. College football 

As a further test of our algorithm, we turn to the world 
of US college football. ("Football" here means Amer- 
ican football, not soccer.) The network we look at is 
a representation of the schedule of Division I games for 
the 2000 season: vertices in the graph represent teams 
(identified by their college names) and edges represent 
regular season games between the two teams they con- 
nect. What makes this network interesting is that it in- 
corporates a known community structure. The teams 
are divided into "conferences" containing around 8 to 12 
teams each. Games are more frequent between members 
of the same conference than between members of differ- 
ent conferences, with teams playing an average of about 
7 intra-conference games and 4 inter-conference games 
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in the 2000 season. Inter-conference play is not uni- 
formly distributed; teams that are geographically close 
to one another but belong to different conferences are 
more likely to play one another than teams separated by 
large geographic distances. 

Applying our algorithm to this network, we find that it 
identifies the conference structure with a high degree of 
success. Almost all teams are correctly grouped with the 
other teams in their conference. There are a few indepen- 
dent teams that do not belong to any conference — these 
tend to be grouped with the conference with which they 
are most closely associated. The few cases in which the 
algorithm seems to fail actually correspond to nuances 
in the scheduling of games. For example, the Sunbelt 
conference is broken into two pieces and grouped with 
members of the Western Athletic conference. This hap- 
pens because the Sunbelt teams played nearly as many 
games against Western Athletic teams as they did against 
teams in their own conference. Naturally, our algorithm 
fails in cases like this where the network structure gen- 
uinely does not correspond to the conference structure. 
In all other respects however it performs remarkably well. 



IV. APPLICATIONS 

In the previous section we tested our algorithm on a 
number of networks for which the community structure 
was known beforehand. The results indicate that our al- 
gorithm is a sensitive and accurate method for extracting 
community structure from both real and artificial net- 
works. In this section, we apply our method to two more 
networks for which the structure is not known, and show 
that in these cases it can help us to understand the make- 
up of otherwise complex and tangled datasets. Our first 
example is a collaboration network of scientists; our sec- 
ond is a food web of marine organisms in the Chesapeake 
Bay. 



A. Collaboration network 

We have applied our community-finding method to a 
collaboration network of scientists at the Santa Fe In- 
stitute, an interdisciplinary research center in Santa Fe, 
New Mexico (and current academic home to both the 
authors of this paper). The 271 vertices in this network 
represent scientists in residence at the Santa Fe Insti- 
tute during any part of calendar year 1999 or 2000, and 
their collaborators. An edge is drawn between a pair of 
scientists if they coauthored one or more articles during 
the same time period. The network includes all journal 
and book publications by the scientists involved, along 
with all papers that appeared in the institute's techni- 
cal reports series. On average, each scientist coauthored 
articles with approximately five others. 

In Fig. U we illustrate the results from the application 
of our algorithm to the largest component of the collab- 




O Atlantic Coast □ Conference USA 

• Big East ■ IA Independents 
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FIG. 5: Hierarchical tree for the network reflecting the sched- 
ule of regular season Division I college football games for year 
2000. Nodes in the network represent teams and edges rep- 
resent games between teams. Our algorithm identifies nearly 
all the conference structure in the network. 
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FIG. 7: Hierarchical tree for the Chesapeake Bay food web 
described in the text. 



FIG. 6: The largest component of the Santa Fe Institute col- 
laboration network, with the primary divisions detected by 
our algorithm represented by different vertex shapes. 



oration graph (which consists of 118 scientists). Vertices 
are drawn as different shapes according to the primary 
divisions detected. We find that the algorithm splits the 
network into a few strong communities, with the divisions 
running principally along disciplinary lines. The com- 
munity at the top of the figure (diamonds) is the least 
well defined, and represents a group of scientists using 
agent-based models to study problems in economics and 
traffic flow. The algorithm further divides this group 
into smaller components that correspond roughly with 
the split between economics and traffic. The next com- 
munity (circles) represents a group of scientists working 
on mathematical models in ecology, and forms a fairly 
cohesive structure, as evidenced by the fact that the al- 
gorithm does not break it into smaller components to any 
significant extent. The largest community (represented 
by the squares) is a group working primarily in statisti- 
cal physics, and is sub-divided into several well-defined 
smaller groups which are denoted by the various shad- 
ings. In this case, each sub-community seems to revolve 
around the research interests of one dominant member. 
The final community at the bottom of the figure (tri- 
angles) is a group working primarily on the structure 
of RNA. It too can be divided further into smaller sub- 
communities, centered once again around the interests of 
leading members. 

Our algorithm thus seems to find two types of commu- 
nities: scientists grouped together by similarity either of 
research topic or of methodology. It is not surprising to 



see communities built around research topics; we expect 
scientists to collaborate primarily with others with whom 
their research focus is closely aligned. The formation of 
communities around methodologies is more interesting, 
and may be the mark of truly interdisciplinary work. 
For example, the grouping of those working on economics 
with those working on traffic models may seem surpris- 
ing, until one realizes that the technical approaches these 
scientists have taken are quite similar. As a result of 
these kinds of similarities, the network contains ties be- 
tween researchers from traditionally disparate fields. We 
conjecture that this feature may be peculiar to interdis- 
ciplinary centers like the Santa Fe Institute. 



B. Food web 

We have also applied our algorithm to a food web of 
marine organisms living in the Chesapeake Bay, a large 
estuary on the east coast of the United States. This 
network was originally compiled by Baird and Ulanow- 
icz [^6| and contains 33 vertices representing the ecosys- 
tem's most prominent taxa. Most taxa are represented 
at the species or genus level, although some vertices rep- 
resent groups of related species. Edges between taxa in- 
dicate trophic relationships — one taxon feeding on an- 
other. Although relationships of this kind are inherently 
directed, we here ignore direction and consider the net- 
work to be undirected. 

Applying our algorithm to this network, we find 
two well-defined communities of roughly equal size, 
plus a small number of vertices that belong to neither 
community — see Fig. [j]. As the figure shows, the split 
between the two large communities corresponds quite 
closely with the division between pelagic organisms (ones 
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that dwell principally near the surface or in the middle 
depths of the bay) and benthic organisms (ones that dwell 
near the bottom). Interestingly, the algorithm includes 
within each group organisms from a variety of differ- 
ent trophic levels. This contrasts with other techniques 
that have been used to analyze food webs |2q] , which 
tend to cluster taxa according to trophic level rather 
than habitat. Our results seem to imply that pelagic 
and benthic organisms in the Chesapeake Bay can be 
separated into reasonably self-contained ecological sub- 
systems. The separation is not perfect: a small number of 
benthic organisms find their way into the pelagic commu- 
nity, presumably indicating that these species play a sub- 
stantial role in the food chains of their surface-dwelling 
colleagues. This suggests that the simple traditional di- 
vision of taxa into pelagic or benthic may not be an ideal 
classification in this case. 

We have also applied our method to a number of other 
food webs. Interestingly, while some of these show clear 
community structure similar to that of Fig. f?| some oth- 
ers do not. This could be because some ecosystems are 
genuinely not composed of separate communities, but it 
could also be because many food webs, unlike other net- 
works, are dense, i.e., the number of edges scales as the 
square of the number of vertices rather than scaling lin- 
early [^7| . Our algorithm was designed with sparse net- 
works in mind, and it is possible that it may not perform 
as well on dense networks. 



V. CONCLUSIONS 

In this paper we have investigated community struc- 
ture in networks of various kinds, introducing a new 
method for detecting such structure. Unlike previous 
methods which focus on finding the strongly connected 
cores of communities, our approach works by using in- 
formation about edge betweenness to detect community 
peripheries. We have tested our method on computer 
generated graphs and have shown that it detects the 
known community structure with a high degree of suc- 
cess. We have also tested it on two real- world networks 



with well-documented structure and find the results to 
be in excellent agreement with expectations. In addi- 
tion, we have given two examples of applications of the 
algorithm to networks whose structure was not previously 
well-documented and find that in both cases it extracts 
clear communities which appear to correspond to plausi- 
ble and informative divisions of the network nodes. 

A number of extensions or improvements of our 
method may be possible. First, we hope to generalize 
the method to handle both weighted and directed graphs. 
Second, we hope that it may be possible to improve the 
speed of the algorithm. At present, the algorithm runs 
in time 0(n 3 ) on sparse graphs, where n is the num- 
ber of vertices in the network. This makes it impractical 
for very large graphs. Detecting communities in, for in- 
stance, the large collaboration networks || or subsets of 
the Web graph || that have been studied recently, would 
be entirely unfeasible. Perhaps, however, the basic prin- 
ciples of our approach — focusing on the boundaries of 
communities rather than their cores, and making use of 
edge betweenness — can be incorporated into a modified 
method that scales more favorably with network size. 

We hope that the ideas and methods presented here 
will prove useful in the analysis of many other types of 
networks. Possible further applications range from the 
determination of functional clusters within neural net- 
works to analysis of communities on the World-Wide 
Web, as well as others not yet thought of. We hope to 
see such applications in the future. 
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